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Abstract 

An additive model-assisted nonparametric method is investigated to estimate 
the finite population totals of massive survey data with the aid of auxiliary in- 
formation. A class of estimators is proposed to improve the precision of the well 
known Horvitz-Thompson estimators by combining the spline and local poly- 
nomial smoothing methods. These estimators are calibrated, asymptotically 
design-unbiased, consistent, normal and robust in the sense of asymptotically 
attaining the Godambe-Joshi lower bound to the anticipated variance. A con- 
sistent model selection procedure is further developed to select the significant 
auxiliary variables. The proposed method is sufficiently fast to analyze large 
survey data of high dimension within seconds. The performance of the proposed 
method is assessed empirically via simulation studies. 

Keywords: Calibration, Horvitz-Thompson estimator, local linear regression, 
model-assisted estimation, spline, superpopulation. 
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1. Introduction 

Auxiliary information is often available in many surveys for all elements 
of the population of interest. For instance, in many countries, administrative 
registers provide extensive sources of auxiliary information. Complete registers 
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can give access to variables such as sex, age, income and country of birth. Studies 
of labor force characteristics or household expenditure patterns, for example, 
might benefit from these auxiliary data. Another example is the satellite images 
or GPS data used in spatial sampling. These data are often collected at the 
population level, which are often available at little or no extra cost, especially 
compared to the cost of collecting the survey data. 

If no information other than the inclusion probabilities is used to estimate 
the population total, a well-known design unbiased estimator is the Horvitz- 
Thompson (HT) estimator. Nowadays, "cheap" auxiliary information can be 
regularly used to obtain higher precision estimates for the unknown finite pop- 
ulation quantities. For instance, post-stratification, calibration and regression 
estimation are different design-based approaches used to improve the precision 
of estimators. Auxiliary information can also be used to increase the accuracy of 
the finite population distribution function; see, for example, [soj. Model-assisted 
estimation (|2l|) provides a convenient way to incorporate auxiliary variables to 
develop more efficient survey estimators. By model-assisted, it is meant that a 
superpopulation model is adopted (for example, model ([1]) below), in which the 
finite population is modeled conditionally on the auxiliary information; see, for 
instance, |i|6|,i,[9|. 

The traditional parametric model-assisted approach assumes that the su- 
perpopulation model is fully described by a finite set of parameters, e.g., the 



regression estimator introduced in [211]. However, survey data now being col- 



lected by many government, health and social science organizations have more 
complex design features. It is difficult to obtain any prior model information 
to address various hypotheses. In this sense, preselected parametric model is 
too restricted to fit unexpected features. In contrast, nonparametric regression 
provides a useful tool for studying the dependence of variables of interest on 
auxiliary information without constraining the dependence to a fixed form with 
few parameters. The fiexibility of nonparametric regression is extremely helpful 
to capture the complicated relationship between variables as well as in obtaining 
robust predictions; see [iQl12| for details. 

Breidt and Opsomer Ij first proposed a nonparametric model-assisted es- 
timator based on local polynomial regression, which generalizes the parametric 
framework in survey sampling and improves the precision of the survey esti- 
mators immensely. Their investigation is only based on one auxiliary variable. 
Most surveys, however, involve more than one study variables, perhaps, many; 



see 



22j. For example, the remote sensing data which provide a wide and grow- 



ing range of variables to be employed. In this context, when the dimension 



2 



of the auxiliary information vector is high, one unavoidable issue is the "curse 
of dimensionality" , which refers to the poor convergence rate of nonparametric 
estimation of general multivariate functions. One solution is regression in the 



form of additive model; see 13 



Estimation and inference for additive models have been well studied in 



the literature; see, for example, the classic backfitting estimators of [13|, the 



marginal integration estimators of |l6|, th e smoothing backfitting estimators of 



17l |. the spline estimators of Stone ([25|, |26|) and the spline-backfitted kernel 
estimators of j29^. In survey sampling context, jij discussed a semiparamet- 
ric possible extension to multiple auxiliary variables via using the penalized 
splines; 19| applied the generalized additive models (GAMs) in an interaction 
model for the estimation of variables from forest inventory and analysis surveys; 
and jif proposed a special case of the GAMs with an identity link function. 
For large and high dimensional survey data, it is important that estimation 
and inference methods are efficient and computationally easily implemented. 
However, few methods are theoretically justified and computational efficient 
when there are multiple nonparametric terms. The kernel based backfitting and 
marginal integration approaches are computationally expensive, limiting their 
use for high dimensional data; see 18|] for some numerical comparisons of these 
methods. Spline methods, on the other hand, provide only convergence rates 
but no asymptotic distributions, so no measures of confidence can be assigned 
to the estimators. 

Challenged by these demands, we propose approximating the nonparametric 
components by using the spline-backfitted local polynomial: spline does a quick 
initial estimation of all additive components and removes them all except the 
ones of interest; kernel smoothing is then applied to the cleaned univariate data 
to estimate with the asymptotic distribution. This two-step estimator is both 
computationally expedient for analyzing large and high dimensional survey data, 
and theoretically reliable as the estimator is uniformly oracle with asymptotic 
confidence intervals. The resulting estimator of population total can therefore 
be easily calculated, and more importantly allow for formal derivation of the 
asymptotic properties of the estimator. 

In practice, a large number of variables may be collected and some of the 
insignificant ones should be excluded from the final model in order to enhance 
the predictability. The selection of auxiliary variables is a fundamental issue for 
model-assisted survey sampling methods. In this paper, we propose a consistent 
variable selection method for the additive model-assisted survey sampling based 
on the Bayes information criterion (BIG). A comprehensive Monte Garlo study 
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demonstrates superior performance of the proposed methods. 

The rest of the paper is organized as follows. Section 2 gives details of the 
superpopulation model and proposed method of estimation. Section 3 describes 
the weighting, calibration and asymptotic properties of the proposed estimator. 
Section 4 describes the auxiliary variable selection procedure for the superpop- 
ulation model under simple random sampling design (SRS). Section 5 reports 
the findings in an extensive simulation study. Lengthy technical arguments are 
given in the Appendix. 

2. Superpopulation Model and Proposed Estimator 

In what follows, let Un = {1, ■■■,N} be the finite population of ele- 
ments, called the target population, and i represents the ith element of the pop- 
ulation. Let Xj = {xii, ...,Xid} be a (i- dimensional auxiliary variable vector, i G 
Un. We are interested in the estimation of the population total ty — "^i^uM 
where yi is the value of the study variable, y, for the ith element. To this end, 
a sample s of size un is drawn from Un according to a fixed sampling design 
Pn {'), where Pn (s) is the probability of drawing the sample s. The inclusion 
probabilities, known for all i G Un, are TTj^v = TTj = Pr {? G s} = J^s^iP^ (^)- 
In addition to the Hi, denote 7iijN = T^ij = Pi^ihj G s} = J^sBijP^is) the 
inclusion probability for both elements i,j E Un- 

Let {(xj, yi)}i^u^ be a realization of (X, Y) from an infinite superpopulation, 
^, satisfying 

y = m(X) + (T(X)£, (1) 
in which the unknown (i-variate function m has a simpler form of 

d 

m(X) = c + ^m„(X„), E5K(X„)] = 0, 1 < a < d, (2) 

a=l 

the function a{-) is the unknown standard deviation function and the standard 
error e satisfies that £'^(e|X) = and £'^(e^|X) = 1. In the following, we 
assume the auxiliary variable is distributed on a compact interval [a^ , ba] , 
a = l,...,d. Without loss of generality, we take all intervals [aQ,,&a] = [0,1]. 
To estimate the additive components in ([2]), we employ a two-stage procedure 
based on the spline-backfitted local polynomial smoothing. 

For any a = l,...,d, we introduce a knot sequence with J interior knots 
koa = < kia < ■■■ < kja < 1 = where J = Jn increases when 
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rijv increases, and the precise order is given in Assumption (A5). Denote the 
piecewise hnear truncated power sphne basis 

r(x) = {l,Xa,{Xa - kia)+, - ■ ■ ,{Xa " ^Ja)+ , « = 1,...,^}'^, (3) 

where (a)^ = a if a > and otherwise. For the local linear smoothing, let 
Kh{x) = h~^K{x/h), where K denotes a kernel function and h = hj^ is the 
bandwidth; see Assumption (A6) below. 

We now describe our two-stage estimator for the population total ty. At the 
first stage, we apply the spline smoothing to obtain a quick initial estimator of 
m(xj), 

d d J 

m (Xj) = 60 + ^ bo^aXia + X] XI ^^'"^ ~ ' 
a=l a=l j=l 



where bo and bj^a, j = 0,1, .., J, a = 1, ...,d are the minimizes of the following 

{d d J 2 

Vi-bo-^ bo,aXa - X] X] ~ f 

a=l a=l j=l ) 

over a Grf = 1 + ( J + l)d dimensional vector. Because the components nia {xa) 
can only be identified up to an additive constants, we center the estimator of 
nia {xa) and define the centered pilot estimator of the ath component as 

J 

{Xa) bo^aXa ~\~ ^ ^ bj^a {Xa Cq, (5) 

where = iV"^ Y^ias '^i^ ^Q,aXia + Y.'l=i bj,a {xia - + The above pilot 
estimators in (|5]) are then used to construct the new pseudo-responses 

yia = Vi- N~^iy-'^rhp{xio), i e s, a = l,...,d, (6) 

where ty is the well-known HT estimator. 

At the second stage, a local polynomial smoothing is applied to the cleaned 
univariate data {xia,yia}i^s ^'^ achieve the "oracle" property in [29|. To be 
specific, considering the local linear smoothing, for any a = 1, d, we minimize 

TT.r^ {ilia - ao,a - tti^a (Xia - x) Kh {Xia - x)Y , (7) 



with respect to ao,a and ai^a- The sphne-backfitted local linear (SBLL) estimator 
of the a-th component rria is m* = ao,a in dZD- The final sample design-based 
SBLL estimator of m (x) is defined as 

1 

m*(x) = -t; + ^m;(a:„). (8) 

a=l 

Substituting m* = m* (xj) into the existing generalized difference estimator 
(see page 221 of |2l|), the SBLL estimator for ty is defined by 

t;.e.. = E *• + E ^ = E 7 + E (i - 1) **. (9) 

where /j = 1 if i G s and = otherwise. 

Remark 1. In the first step spline smoothing, the number of knots J^r can be 
determined hy and a tuning constant c: 

= min ([cnj,^ log(n^)] + 1, [(n^/2 -l)/d- 1]) . (10) 



As discussed in [29|, the choice of c makes little difference. In the second step 
local polynomial smoothing, one can use the quartic kernel and the rule-of-thumb 
bandwidth. 

3. Properties of the Estimator 

3.1. Weighting and Calibration 

In the last decade, calibration estimation has developed into an important 



field of research in survey sampling. As discussed in and 15 1, calibration is 
a highly desirable property for survey weights, which allows the survey practi- 
tioner to simply adjust the original design weights to incorporate the information 
of the auxiliary variables. Several national statistical agencies have developed 
software to compute calibrated weights based on auxiliary information available 
in population registers and other sources. The proposed SBLL estimator in this 
paper also shares this property in certain sense. 

Let ys be the column vector of the response values Ui for i e s and define the 
diagonal matrix of inverse inclusion probabilities IT,, = diag {l/vTjj.g^. For r(x) 
in ([3]), denote = {F(xj)^}.^^ the sample truncated power spline matrix. 
Let be the collection of the estimated spline coefficient in then = 
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(rjll^rs) ^ rjllsys. Thus the pilot sphne estimator of rria {xa) in ([5]) can be 
written as 

ma (xa) = {r (x)^ D,B, - N-'lln,T,BaB,} y„ (11) 
where 1„ is a vector of length with all "l"s, and 

= diag{0,...,0, 1^^, 0,...,0} (12) 

from (J+l)(o-l)+2 to (J+l)a+l 

is a. Gd X Gd diagonal matrix. Denoting the spline smoothing matrix and its 
centered version by 

* =rD fr^nr)~^r^n =fi-A^^ii^nW 

^ sa s'^ a y- s s j s si ^ sa \ -'-n-'-n sj ^ sai 

we have rho, = {rha (xiQ,)}-^^ = ^*^ys, for a = 1, d. Further for yia in (Q, let 
fa = {yia}i(.s = ys- j^iy'^n " ^1^^ ^ud define the matrices 

-^sia 1 ( Xka Xia ) /i.^ ; sia diag \ Kji (x^q, 3^ia) ( 

Then the SBLL estimator of rria at be written as 

^ia ~ ^a (.-^ia) i.'^sia^'^ sia'^sia') '^sia^'^ siaYai (^3) 

where ei = (1, 0)^. Therefore, the SBLL estimator in ([8]) of m (x) at Xj is 
1 ft 



o=l \ P^a 



where 

Psi ~ ^1 |X]o=l O^sia^ sia'^sia) "^sia^ sia + 'dJ^'^'n-'^rJ^s ~ YljSjta ^ s/sj | • 

Similar to jiof, we define the "g- weight" 



(14) 



where aj is a rijv-dimensional vector with a "1" in the zth position and "0" 
elsewhere. Thus the proposed estimator t^^sBLL in ([8]) can be written as 



i&s j&Ufq ■' ' i€s 
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which is a hnear combination of the sample i/j's with a samphng weight, 7i^^, 
and the "g-weight". Because the weights are independent of iji, they can be 
apphed to any study variable of interest. 

As we show below, the weight system gives our estimator of the known total 

Theorem 1. For any a = 1, ...,d and the "g-weight" defined in [T4\ ), 
Proof. Let = {xia}i^g. We have 

. 7=1 \ /3^7 



Observe that 



(rjn.r,)-' rjn.x, = t„, *,^x, = r.D^r, 



Xq,, for /3 = a 
0, for /3 7^ a 



where Tq, is the vector of dimension Gd with a "1" in the {2 + ( J + 1) (a — l)}th 
position and "0" elsewhere. Then we have. 



, 1^1 -V**^x - I (i + ^i"inns)x„, for 7 

+ dN -""-""''^ ^^/^ I - \ ^l„l^n,x„, for 7 



for 7 7^ a 



/37^7 

Note that for any i G Ujy, 

eT (X^- W ■ X • I'^X'^- W ■ X = r- e?' fX'^- W ■ X • X'^- W ■ 1 =1 
thus 

= e^ (XJ,„W,,,X,,J-^ XJ,„W,,„ {I + (rfiV)-i(l - d)Kiin,} x„ 
+r/~^e?" fX^. W • X • "l""^ X^. W • 1 I'^n X = r- 

Hence the proposed SBLL estimator defined in 1^ preserves the calibration 
property. □ 
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3.2. Assumptions 

For the asymptotic properties of the estimators, we adopt the traditional 
asymptotic framework in [l[ where both the population and sample sizes increase 
as N oo. There are two sources of "variation" to be considered here. The 
first is introduced by the random sample design and the corresponding measure 
is denoted by p. The "Op", "Op" and ''Ep{-)" notation below is with respect to 
this measure. The second is associated with the superpopulation from which 
the finite population is viewed as a sample. The corresponding measure and 
notation are "^". For simplicity, let TCij — Hi-Kj = Aij. 

(Al) The density / (x) of X is continuous and bounded away from and oo. 

The marginal densities fa (xa) of have continuous derivatives and are 

hounded away from and oo. 
(A2) The second order derivative of continuous, V a = 1, d. 

(A3) There exists a positive constant M such that |X j < M for some 

5 > 1/2; cr (x) is continuous on [0, 1]'^ and hounded away from and oo. 
(A4) yls — oo, T^AT — oo and njyN^^ — > tt < 1. 
(A5) The numher of knots Jn ~ n^ii^ log(niv). 

(A6) The kernel function K is Lipschitz continuous, hounded, nonnegative, 
symmetric, and supported on [—1,1]. The handwidth ~ n^^^., i.e., 
Chn~^^^ < < ChU^^^ for some positive constants Ch, Ch- 

(A7) For all N, minjg[/^ > A > 0, minjjgt/^ vTjj > A* > and 

limsuprijv max \Aij\ < oo. 

7V-S.OO i,j<^UN,i¥=j 

(A8) Let Dk,N he the set of all distinct k-tuples {ii,i2, ■■■,'ik) from Un- Then 



limsupn^v max \Ep [(/j^ - tt^J {U^ - vTiJ {h^ - iTi^) {U^ - vTiJ]] < oo 

N—>-00 («l,«2,i3,«4)e-D4,]V 



lim sup max \Ep [{liji^ - vTi^iJ {lialii - '^isiM < 

N^OO («l,«2,l3,M)e£'4_]v 



lim sup max \Ep [{li^ - vTiJ (Jj^ - VTiJ (/jg - VTig)] | < oo. 

Af-S>00 («1,«2,«3)G-D3_jv 

Remark 2. Assumptions (Al)-(A3) are typical in the smoothing literature; see, 
for instance, 

0,0, Assumption (A5) is about how to choose the number 
of interior knots Jn for the spline estimation in the first stage. In practice, Jn 



9 



can be determined by ffTOl) . Assumption (A6) is how to select the kernel function 
and the corresponding bandwidth. Such assumptions were used in ji^ in the 
additive autoregressive model fitting. Assumptions (A7) and (A8^ involve the 
inclusion probabilities of the design, which were also assumed 



3.3. Asymptotic properties of the estimator 

Like the local polynomial estimators in the following theorem shows 
that the estimator ty^sBLL in dH]) is asymptotically design unbiased and design 
consistent. 

Theorem 2. Under Assumptions (A1)-(A7), the estimator iy^sBhi. in ^ is 
asymptotically design unbiased in the sense that 



lim E„ 



ty,SBLL ty 



with ^-probability 1, 



and is design consistent in the sense that for all rj > 0, 

with ^ -probability 1. 



lim Eri 



^{\iy,SBLL-ty\>Nri} 



Let ty,sBLL be the population-based generalized difference estimator of ty 
when the entire realization were known; see (IA.4P in Appendix A.l for the 
formal definition. Like the estimators in the local polynomial estimators in [H, 
the penalized spline estimators in [2!], and the backfitting estimators in jsj, the 
following theorem shows that the proposed estimator ty,sBLL also inherits the 
limiting distribution of the "oracle" estimator t^^sBLL- 



Theorem 3. Under Assumptions (A1)-(A8), 

N ^ (^i/,SBLL ~ ty) d 

VarJ/^ (iV-^tj/,sBLL) ~" 

^ ('^j/,SBLL ty) d 



AT (0,1) 



as N ^ 00 implies 



N{0,1] 



where 



V (iV-w) ^"i^'^l^. (15) 
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The next theorem proves that ^j^^sbll is robust as in [Ij and it also asymp- 
totically attains the Godambe-Joshi lower bound to the anticipated variance 

Var [iV-i (t; - ty)] = E [N'^ (t, - ty)]' - [N'' (£, - t,)] , 

where the expectation is taken over both design, p^, and population ^ in ([T]). 

Theorem 4. Under Assumptions (A1)-(A8), iy^^^^^^ asymptotically attains the 
Godambe-Joshi lower bound, in the sense that 

^ 2 

The proofs of Theorems [2}|3] are given in the Appendix. 

4. Auxiliary Variable Selection 

In this section, we propose a BIC-based method to select the auxiliary vari- 
ables for use in the superpopulation model ([2]). 



The BIG was first proposed in [231] for the selection of parametric models. 
Recently, 13] proposed a fast and consistent model selection method based on 
spline estimation with the BIG to select significant lags in non-linear additive 



autoregression. Analogous to the approach in [1^, if the entire realization were 
known by "oracle", one can select significant auxiliary variables based on the 
BIG. For an index set of variables r G {1, ...,d}, the BIG is defined as 

BIGM = log {aMSEW {N-Hy,s^,,) I + ^ log(n^), (16) 

where J'r = 1 + J2a&r('^N + 1), and AMSE {N^^iy^sBLL) is the asymptotic mean 
squared error (AMSE) of N~^iy ^^BLh in flA.lip . i.e. the asymptotic expectation 

of {AT-i (4,SBLL-tj;)}'. 

Next let / = n7v/A^ be the fixed sampling fraction. Under simple random 
samphng (SRS) design, if cr^(x) = c'^x. 



AMSE (iv-4;,,B,o = ^ L^ix E (y^ - 



Thus, using similar arguments in Section 5 of [IJ], we can show that the above 
BIG in (fT6l) is consistent under SRS. 
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By Theorem IA.2t AMSE (A ^^-sbll) can be estimated consistently by 

iV^ — ^ 7r,-i TT,- TT,- 

a modified version of (fT5l) proposed by [20] witli tlie "g-weiglit" in ([HI). So tlie 
sample-based BIG is defined as 



BICM = log|i/M| + ^log(n^) 



and we select the subsect f C {1, c?} that gives the smallest BIG value. 

Remark 3. Under SRS design, the variance estimator given in f|T7|) can be 
simplified as 

% = 7 — ^—r: Y\ gl iVi - rri*f . 

UN [Un - 1) 

In practice, we first decide on a set of candidate variables to be selected. 
Since a full search through all possible subsets of variables is in general com- 
putationally too costly in actual implementation of the BIG method, we con- 
sider a forward selection procedure and a backward selection procedure. Let 
d denote the total number of candidate variables to be selected from. In the 
forward selection procedure, we pre-specify the maximal number of variables 

(imax = min jci, 2(Jm+i) } ^^^^ allowed in the model, in which [a] denotes 
the integer part of a. We start from the empty set of auxiliary variables, add 
one variable at a time to the current model, choosing between the various can- 
didate variables that have not yet been selected by minimizing BIG in fll8l) . The 
process stops when the number of variables selected reaches cimax- In the back- 
ward selection procedure, we start with a set of variables of the maximal size 
c^max, delete one variable at a time by minimizing the BIG and stop when no 
variable remains in the model. If (imax < d, we first apply the forward selection 
procedure, then we start with the maximal set of variables selected in the last 
step of the forward stage. 



5. Simulation Study 

In this section, simulations are carried out to investigate the finite-sample 
performance of ty,sBLL- For comparison we also obtained the results of four 
other estimators: the HT estimator which does not make use of the auxiliary 
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population, the linear regression (LREG) estimator in the one-step linear 
spline (LS) estimator defined by 

d 

iy,Ls = ^{Vi - mi)/7ii + ^ rhi, rhi = N~Hy + ^m^a 

with rhia = mn{xia) given in (jS]), and the single- index model-assisted (SIM) 
estimator in [27|. The number of knots Jn for the LS and SBLL is determined 
by (HO]). 

For the superpopulation model ([1]), the following four additive models (no 
interactions) were considered: 

2-dim linear: F = -1 + 2X3 + AX^ + aoe, 

2- dim quadratic: F = 5.5 - 6X2 + 8(X2 - .5)^ - 3Xio + 32(Xio - .5)^ + aoe, 

3- dim mixed: Y = 8(X2 - .5)^ + exp (2X5 - 1) + sin {27r(X8 - .5)} + aos, 
5-dim sinusoid: Y = 2 + sin {27r(X„ - .5)} + f (ELi ^aY^^e, d = 5. 

The auxiliary variable vectors Xj, i G Un, were generated from i.i.d. uniform 
(0, 1) random vectors. The errors e were generated from i.i.d. X (0, 1) with 
noise level cxo = 0.1, 0.4. The population size was X = 1000. SRS Samples 
were generated of size = 50, 100 and 200. For each combination of noise 
level and sample size, 1000 replicated SRS samples were selected from the same 
population, the estimators were calculated, and the design bias and the design 
mean squared errors were computed empirically. 

Table [1] shows the ratios of the mean squared error (MSE) for the various 
estimators to the proposed SBLL estimators. From the table, one sees that 
the model-assisted estimators, LREG, LS, SIM and SBLL, perform much better 
than the simple HT regardless the type of mean function, standard error and 
sample size. For Model 1, LREG is expected to be the preferred estimator, since 
the assumed model is correctly specified. However, not much efficiency is lost 
by using SBLL instead of LREG and the MSE ratios of LREG to SBLL are 
at least 0.89 for all cases. For all other scenarios, SBLL performs consistently 
better than LREG. The SBLL estimators also improve upon the LS estimators 
across almost every combination of noise level and sample size, which implies 
that our second local linear smoothing step is not redundant. 

To see how fast the computation is. Table [1] also provides the average time 
of generating one sample of size ra^r and obtaining the SBLL estimator on an 
ordinary PC with Intel Pentium IV 1.86 GHz processor and 1.0 GB RAM. It 
shows that the proposed SBLL estimation is extremely fast. For instance, for 
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Model 4, the SBLL estimation of a 5-dimensional of size 200 takes on average 
merely 0.2 second. We also carried out simulations for high dimensional data 
with sample size rijv = 1000 generated from the population of size 10000. Re- 
markably, it takes on average less than 60 seconds to get the SBLL estimator 
even when the dimension reaches 50. 

In Table [2] we give the Monte Carlo bias and standard error of the SBLL 
estimator based on its sampling distribution over 1000 replications. Table |2] 
also show the square root of the average estimated variance of the population 
total ( !T5|) . We see that the biases of the SBLL estimator are very small and the 
variance estimator appears to perform well for medium sample size. 

Next we conducted simulations to evaluate the performance of the variable 
selection method. We generated 100 replications for each of the above models. 
The variables were searched from {1,2, ...,10} for all methods and we set the 
maximum number of variables allowed in the model to be 10. Table [3] shows 
the number of correct fit (C), underfit (U) and overfit (O) based on the BIG in 
(fTSjl over 100 simulation runs. Here underfitting means that the method misses 
at least one of the significant variables. From Table [3l we can see that both the 
forward and the backward selection procedures perform very well for moderately 
large sample size. We also obtained the ratio of MSE of the SBLL estimates 
calculated by using the selected model to the MSE of the oracle SBLL estimates 
computed by using the true model. In all the cases, the ratios are very close to 
1 or exactly 1 for moderately large sample size. 



6. Discussion 

Nonparametric additive methods enhance the flexibility of the models that 
survey practitioners use. However, due to the limitations in either interpretabil- 
ity, computational complexity or theoretical reliability, these models have not 
been widely used as general tools in survey data analysis. In this paper, we have 
advanced additive models as flexible, computationally efficient and theoretically 
attractive tools for studying survey data. We also developed a consistent proce- 
dure to select the significant auxiliary variables under simple random sampling 
design. 

The proposed method in this paper is appropriate only for survey data that 
follow simple additive model. The limitation of the basic additive model is that 
the interactions between the input features are not considered. There are other 



models, for instance, single-index model [27j, additive model with second-order 



interaction terms [2J], which reduce dimensionality but also incorporate inter- 



actions. Additive partially linear model [11| is another parsimonious candidate 
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when one believes that the relationship between the study variable and some 
of the auxiliary variables has a parametric form, while the relationship between 
the study variable and the remaining auxiliary covariates may not be linear. 
These alternative models are supposed to be more efficient in certain cases, but 
obtaining the asymptotics is likely to be very complicated, thus we leave it as 
future research work. 

Finally, in our methodology development, we have assumed that the auxil- 
iary variables are available for all population elements. It would be interesting 
to consider the limited auxiliary information case jsf where only some summary 
quantities such as means are available at the population level. This is also a 
challenging problem for future research. 

Appendix 

To show the asymptotic properties of the proposed estimator ty,sBLL, we first 
introduce an "oracle" SBLL estimator of ty if the entire realization were known. 

A.l. The Population-based Estimator 

If the entire realization were known, let Tu = (F (xj)^| be the population- 
based truncated power spline matrix, where T (x) is given in ([3]). Let y be the 
vector of the response values yi for i eUn- Further let B(/ = {TJjTu)~^ TJjy. 
The centered pilot estimators of ma (xq.) at the first stage is 

m„ (x J = r (x)^ D.Bt; - N-HlTuT^a^u. (A.l) 

where vector 1^ = {1, 1, 1} of length A^. The pilot estimators for all elements 
in the population is denoted by 

= {rfla {Xia)}i^UN = ("^ ^ A^""^lArl^) Tt/DaBc/, a = 1, d. 

For the second stage kernel smoothing, define the matrices 

'^Uia { ( -^ka -^ia ) ' "^^Uia diag {Kh (Xfca •^ia)} i^^ijj^ ■ 

Then the SBLL estimator of each component at Xj is given by 

m*^ = ej {Xjji^Wuia'^Uiay^ Xlji^WuiaYa, (A.2) 

where = y — j^ty'^N — collection of the pseudo-responses. The 

SBLL estimator of m (xj) based on the entire population is given by 

1 

rhi = j^ty + ^^*ia^ ieUN. (A.3) 

a=l 
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Clearly, m* is the prediction at Xj based on the entire finite population. If these 
m* were known, a design-unbiased estimator of ty would be 



— — TTj 

The proof of the asymptotic properties of ty,sBLL uses reasoning similar to 
that in in which a key step is the Taylor linearization. Recall that our 
proposed estimator involves two smoothing stages: spline smoothing in the first 
stage and kernel smoothing in the second stage. In the following, we establish 
the Taylor linearization for these two smoothing stages one by one. 

A. 2. Taylor Linearization at the First Stage 

Lemma A. 1. Under Assumptions (A1)-(A7), for any a = 1, ...,d, 

lim sup {rha {xa) — {xa)\ = Op { JAr(A^^^log A^)"*^^^} , 

where rha {xa) and fha {xq) are the pilot estimators given in / flT]) and M.ij) . 

Proof. Let S = N~^TJjTu and V = A^~^r^y be matrices with components 
^n' = ^'^T.keUM'^u,kj^u,kj' and Vj = N'^ Y^kaUM^u^kiVk, respectively. De- 
note = A^^^rjUsFs and = A^'^r^n^ys the sample versions of the ma- 
trices S and V with components s-^jj' = N^^J^kes ^s,kj'rs,kj'/ T^k and v-^j = 
J2kes ^s,kjyk/ TTfc. For each a = 1, ...,d and the spline basis F (x) in ([3]), let 

C (S^, V^; x^) = F (x)" {S-'Y^ - S-'V) (A.5) 

be a nonlinear function of {sn,jj'}i<j ji^q^ and {vnj}^^^ with respect to Xa- The 
difference (xa) — rha (xa) = C i^n, ^n'-, Xa) + Op{N~^^'^). Simple calculation 
shows that the first order derivatives of C, in flA.SP of s-^^jji and 

dC 

^ -- F(x)"D,(-S;iA,yS;i)V., l<j,f<G,, 



F(xf D„S;iA„ l<j<G,, 



where Xj is a G^-vector with "1" in the jth component and "0" elsewhere; 
and Ajj> is a x matrix with "1" in positions (j, j') and and "0" 

everywhere else. 
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Using the Taylor linearization, one can approximate ( in (IA.5|) by a linear 
one so that the difference between rha (xa) and rha (xa) can be decomposed as 

where for any 1 < j,j' < Gd, 

=r(x)^D„s-U,, 

= r(x)^D„ (S-^AjyS-^) V, 



^aj \Xct ) 



the remainder. Note that 



fcec/jv i=i ^ ' 



1 \yk 



and QaAT (xq) is 

Gd 

-^"^ XI XI '^"J' (^^'^J' ~ ^^'^j' 

fcGf/jv i=i 

Gd 

+A^"^ X X '^"^^ ^^'^'^'j ~ • 

keUN i=i 

By the discretization method given in Lemma A. 4 of 29|], the Borel-Cantelli 
Lemma entails that each single term in the right hand side of the above is of 
the order Op | Jjv(A^^^log A^)^/^}. Therefore, we have 

Gd 

X Vaj (Xa) {V^J - Vj] 



sup 



= Op{MN-HogNY/'}. 



Similar arguments lead to sup^.^g[Q^i] 



IS of the 



order Op {JNiN-HogNy/^} , and sup,^g[o,i] \QaN (x^)] = Op {JNiN-HogNy/^} 
Thus sup^.^g[o,i] C ('^'T, V,r; a^a) = Op { JAf(A^^"'^log A^)^/^}. The desired result is 
established. □ 

.3. Taylor Linearization at the Second Stage 



A 



Let 



tiaq / Kfi (Xfco- a^ia) (S'fca Xia 

k£s ^ 
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for g = 1, 2, 3 and 
t 



t 



T^k 



Vkct) 



for g = 4, 5. We rewrite m*^ in f lA.2p and m*^, in f lT^ by 



tialtiaS ^ia2 tin-\ tin?, 



Let Ziajs — ] 



9=^ a Af-iti, 



^iafcg, where ti„ = {tiaj^^i and 



^iakq 



Kh {Xka - Xia) {Xka " Xia^ Vka, foT ? = 4, 5. 



g-4, 



for q = 1,2, 3, 



Then one can approximate rh*^ — rfi*^ by a hnear sum, i.e. 



fcec/iv ^ ^ 



RiaNi 



where L^^at = X;q=i ^iaJVg with 



HaN2 



HaN3 



1 N dm*. 



1 dm*^ 



1 



iic=tic. ^^Um 
Ik 



.err., \ K / 



drh*n 



T^k 



(A.6) 



1 1 ^ {mp {xkp) - nifs (xkfs)} 



Ziak2 



h 



dm* 



and i^ioAT is the remainder. Similar to the proof of Lemma 3 in pj , 



UN 

N 



i&Un 



— - 1 ) {rhp {xkp) - mp {xkp)} 



(A.7) 
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Lemma A. 2. Under Assumptions (A1)-(A8), A^~^ Zliec/jv i^f^N) 0- 



Proof. By the Cauchy-Schwartz inequality, it suffices to show that for q = 
1,...,4, "^-^^^ Ep [L'^^j^g) — )■ 0. Without loss of generality, we only show 
the cases for q = 1 and 2. Similarly to the proof of Lemma 2 (v) in jlj, the 
ffist order derivatives of rh*^ with respect to N~Hiaq evaluated at tj = tj are 
uniformly bounded in i. So by Assumption (A7) 

]^ 5Z (^Livi) 



(ty tyj 



dm* 



< 



c 

iV5 



E 

j,k,l,pGUN 

Next 



^iajl^ialiykVp 



4 

< 



c_ 



IVkVpl 0. 



k,pGUN 



aN2 



iGC/jv 



9mt 



4 



By Lemma lA.ll A^~^ X^ief/jv i^'iaN2) ~^ 0) the lemma follows immedi- 
ately. □ 

A.4. Proofs of Theorems [H H and H 

Lemma A. 3. Under Assumptions (A1)-(A8), for the population and sample 
based SBLL estimators ofm{xia) given in M.3|) and 



lim —Ep 



0. 
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Proof. According to flA.6|) . one has 



En 

N " 



E 



Jp 2^ ZiakEp 



N 



1 ) {LiaN — RiaN ) 

T^k 



laN ~ RiaN 



Following from Lemma 4 in [l[ and Assumption (A7), the first term converges to 
zero as — )■ cxD. The third term also converges to zero by f lA.7p and Lemma [A .21 



/ ^ * ~ * \ 



By the Cauchy-Schwartz inequality, limTv^oo j^Ep [^^g^^ — m*^ ] = 
q; = 1, d. Note that 

1 2 f 

By Assumption (A. 7), 



Thus the desired result is obtained from the Cauchy-Schwartz inequality. □ 
Proof of Theorem [H Note that Ep [/J = tTj and 



^J/,SBLL 

N 



Then 



N 



< — 



E (y. - rK) {hl-n. - 1) 
ief/jv 



+ 



Ar2 





E -"^if 


Ep 









E (1 - 

.iGUn 



(A.9) 

1/2 
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According to Assumptions (A1)-(A6), limsupjy-^ ;^ ^j^f/j^ (z/j — < oo. 

Following the same arguments of Theorem 1 in the first term on the right 
of flA.9p converges to zero as — ?■ oo. For the second term, (A7) implies that 



Ep 



E TTj (1 - Hi) ^ 1 
Nnf - A' 



According to Lemma [A. 31 limAr_^oo ^ J2ieUN ~ ""^i)^] ~^ the re- 

sult follows from the Markov's inequality. □ 
The next theorem is to derive the asymptotic mean squared error of the 
proposed spline estimator in (|9]). 

Theorem A.l. Under Assumptions (A1)-(A8), 



Denote 



AMSE (Ar-H;_) = -1 A,^^ (A.11) 



Ar2 -'^ T,. 



the asymptotic mean squared error in (lA.lOp . The next result shows that it can 
be estimated consistently by V {N~^iy ^sbll) in ( |T5i) . 

Theorem A. 2. Under (A1)-(A8), 



lim nj^fEr, 



V (Ar-H;,sBLL) - AMSE (N-^ty,SBLL) 



0. 



The proofs of Theorems lA.ll and IA.2I are somewhat trivial and we refer the 
readers to 

Proof of Theorem [31 According to flA.Sp . 

N ~ N ^ N \Vi ~ 

From the proof of Theorem I A . ![ XliGt/jv jv'"' ~ '^v ^n^'^^ ■ Theorem 

El implies that V /AMSE {N~Hy^^^^^ 1 in probability. The 

desired result follows. □ 
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Proof of Theorem (4]. Let 

1/2 / T \ 1/2 



r, = ^E{*.--"W}(i-^).r. = ^E(*'-*.-)(i-7) 



Then nJ^N ^ (ij,,sBLL — ^y) can be represented as the sum of Ti, T2 and T3. For 
the first term, 



< 



By Theorem 2.1 in [29[, |m (xj) — m*| = Op {n logn), for any i G [/at, which 
implies that ETl 0. Now for T2 

By Lemma [A31 ETl ^ 0. Finally, 

,2 "^N sr^ 2 , X 1 - TTj 1 2 



^^2 = !^ y ^^(x,)^— ^ < ^- y 



limsup E'Tg < — limsup — cr^ (xj) < 00. 



By the Cauchy-Schwartz inequality the cross product terms go to zero as — )■ 
00. The desired result follows. □ 
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Table 1: Ratio of MSE of the HT, LREG, LS and SIM estimators to that of the SELL 

estimator and the average computing time of the SELL estimator based on 1000 replications 
of SRS samples from four fixed populations of size N = 1000. 



Model 


odiiipie biza 




MSE Ratio 






I I'M 


HT 






SIM 






50 


140.36 


0.89 


1.12 


1.60 


0.07 


0.1 


100 


148.03 


0.91 


1.07 


1.33 


0.07 


1 


200 


147.03 


0.92 


1.10 


1.02 


r\ f\f\ 

0.09 




50 


9.78 


0.92 


1.16 


1.24 


0.07 


0.4 


100 


10.50 


0.95 


1.10 


1.02 


0.07 




zUO 


10.4/ 


0.98 


1.05 


1 C\A 

1.U4 


0.09 




50 


134.05 


28.38 


2.11 


19.77 


0.07 


0.1 


100 


282.47 


58.10 


1.03 


36.58 


0.07 


2 


200 


313.93 


66.63 


0.98 


41.15 


0.09 




50 


18.45 


4.25 


2.36 


3.44 


0.07 


0.4 


100 


23.67 


5.34 


1.04 


3.69 


0.07 




200 


23.36 


5.63 


1.02 


3.92 


0.09 




50 


63.14 


30.83 


1.10 


37.12 


0.07 


0.1 


100 


103.33 


49.62 


1.01 


50.76 


0.07 


3 


200 


115.13 


56.57 


1.02 


57.04 


0.09 




50 


6.80 


3.46 


1.11 


3.93 


0.07 


0.4 


100 


8.18 


4.20 


1.14 


4.40 


0.07 




200 


18.39 


4.52 


1.09 


4.57 


0.09 




50 


55.81 


25.26 


1.01 


27.61 


0.07 


0.1 


100 


151.59 


62.63 


1.03 


65.78 


0.07 


4 


200 


230.44 


97.91 


0.97 


99.45 


0.09 




50 


9.97 


4.75 


1.03 


5.22 


0.07 


0.4 


100 


16.35 


7.10 


1.01 


7.44 


0.07 




200 


19.95 


8.60 


1.05 


8.74 


0.09 
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Table 2: Monte Carlo bias, standard error and the square root of the average estimated 
variances ([T5|) of the population total based on 1000 simulations. 



Model 0- 


n 


Bias 


SE 


Est. SE 




50 


-0.10 


14.69 


13.18 


0.1 


100 


-0.36 


9.85 


9.32 


1 


200 


-0.13 


6.55 


6.29 




50 


-1.62 


57.73 


51.81 


0.4 


100 


-1.55 


38.51 


36.77 




200 


-0.42 


25.71 


24.86 




50 


1.27 


24.49 


14.06 


0.1 


100 


0.62 


11.52 


9.13 




200 


0.37 


7.06 


6.10 


2 


50 


2.41 


67.66 


52.45 


0.4 


100 


-0.47 


40.94 


36.15 




200 


-0.13 


26.54 


24.33 




50 


2.29 


20.40 


13.38 


0.1 


100 


0.90 


10.91 


8.74 




200 


0.48 


6.82 


5.88 


3 


50 


2.17 


64.89 


50.86 


0.4 


100 


-0.04 


40.17 


35.44 




200 


0.32 


26.30 


23.99 




50 


-1.98 


29.04 


18.04 


0.1 


100 


-0.51 


12.28 


8.22 


4 


200 


-0.10 


6.38 


4.82 




50 


-4.38 


69.69 


43.31 


0.4 


100 


-1.18 


37.92 


27.72 




200 


-0.37 


22.58 


18.56 
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Table 3: Simulation results for axixiliary variable selection based on 100 replications of SRS 
samples from four fixed populations of size N = 1000. (Here the MSE Ratio is the ratio of 
MSE of the SBLL estimator calculated by using the selected model to the MSE of the oracle 
SBLL estimates computed by using the true model.) 



Model 




n 




Forward 






Backward 




c 


U 





MSE 
Katio 


c 


U 





MSE 
Katio 






50 


72 





28 


1.150 


73 





27 


1.124 




0.1 


100 


97 





3 


1.001 


97 





3 


1.001 


1 




200 


99 





1 


0.999 


99 





1 


0.999 




50 


76 





24 


1.147 


77 





23 


1.145 




0.4 


100 


98 





2 


1.002 


98 





2 


1.002 






200 


100 








1 AAA 

1.000 


1 AA 

100 








1 AAA 

1.000 






50 


87 





13 


1.255 


87 





13 


1.255 




0.1 


100 


96 





4 


1.012 


96 





4 


1.012 


2 




200 


100 








1.000 


100 








1 nnn 

1.000 




50 


79 





21 


1.019 


80 





20 


1.022 




0.4 


100 


98 





2 


1.000 


98 





2 


1.000 






200 


100 








1.000 


100 








1.000 






50 


87 





13 


1.082 


86 





14 


1.082 




0.1 


100 


91 





9 


1.000 


91 





9 


1.001 


3 




200 


100 








1.000 


100 








1.000 




50 


83 





17 


1.020 


83 





17 


1.020 




0.4 


100 


99 





1 


1.000 


99 





1 


1.000 






200 


100 








1.000 


100 








1.000 






50 


68 





32 


1.277 


69 





31 


1.277 




0.1 


100 


88 





12 


1.029 


88 





12 


1.029 


4 




200 


100 








1.000 


100 








1.000 




50 


69 





31 


1.063 


69 





31 


1.063 




0.4 


100 


97 





3 


1.000 


97 





3 


1.031 






200 


100 








1.000 


100 








1.000 
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